Domination based classi  cation algorithms  or the controllability analysis o  biological interaction networks

Deciding the size o  a minimum dominating set is a classic NP ‑ complete problem. It has  ound increasing utility as the basis  or classi  ying vertices in networks derived  rom protein–protein, noncoding RNA, metabolic, and other biological interaction data. In this context it can be help  ul,  or example, to identi  y those vertices that must be present in any minimum solution.Current classi  cation methods, however, can require solving as many instances as there are vertices, rendering them computationally prohibitive in many applications. In an e  ort to address this shortcoming, new classi  cation algorithms are derived and tested  or efciency and e  ectiveness. Results o  per  ormance comparisons on real ‑ world biological networks are reported. Let G = < V,E > denote a  nite, simple, undirected graph of order n . A dominating set for G is a subset D of V with the property that every vertex in V-D has a neighbor in D . A minimum dominating set (MDS) is of course one of smallest cardinality. Its size is usually denoted by γ ( G ). Deciding MDS is both NP -complete 1 and W[2]- complete 2 . It is easy to see that G may have as many as 3 n /3 distinct MDS solutions, as is demonstrated by the union of n /3 disjoint triangles. A common strategy is therefore to concentrate on signi  cance and classify a vertex as “essential” (aka “critical”) if it is used in every MDS, as “intermittent” if it is used in some but not every MDS, and as “redundant” if


Preliminaries
Notation.Let u and v denote elements of V. e distance between u and v is the number of edges in a shortest path between them.e neighborhood of u, denoted by N[u], comprises u and its neighbors or, equivalently, those vertices within distance one from u. (is is sometimes called the closed neighborhood of u, in order to distinguish it from the open neighborhood N[u] − {u}.)Neighborhoods are extended to sets in a straightforward fashion.us, for a set S of vertices, N[S] denotes S and the neighbors of all its elements.An orbit is an equivalence class of vertices under the action of an automorphism group.at is, u and v belong to the same orbit if and only if there exists a relabeling of V that results in an isomorphic graph for which u and v have exchanged labels 22 .Finally, given an MDS, D, we say that u dominates v whenever u and v are adjacent, and u but not v is an element of D.
Prior work.e vertex classication problem has been studied 20,21 using the two previously-mentioned observations coupled with an MDS algorithm that employs Integer Linear Programming (ILP).Despite the fact that known ILP methods can in principle require exponential time, a major appeal of this approach relies on the existence of powerful commercial ILP solvers that tend to work extremely well in practice.us, once an initial MDS, D, has been computed, one needs only to consider each vertex, u, in turn.
• If u ∈ D, then construct an ILP instance of MDS with a constraint to exclude u.We refer to the resultant procedure as ILP-exclude, with parameters G and u.If γ(ILP-exclude(G,u)) exceeds γ(G), then u is essential, otherwise it is intermittent.• And if u ∉ D, then construct an ILP instance of MDS with a constraint to include u.We refer to the result- ant procedure as ILP-include, also with parameters G and u.If γ(ILP-include(G,u)) exceeds γ(G), then u is redundant, otherwise it is intermittent.
Classier A. For the sake of clarity and exposition, and to help explicate algorithmic comparisons, this procedure (previously unnamed in 20,21 ) is presented here in pidgin code and dubbed Classier A. We note that the exploitation of pendant vertices can be employed at start-up, while the examination of neighbors is best applied only aer all essential vertices have been identied.
Classier A requires low-order polynomial time to initialize C and R (an exact upper bound depends on graph density and the data structures used), exponential time for a call to an ILP solver to answer a single instance of MDS, and time for at most n exponential-time calls to ILP-exclude/include. Classier A's needs for extra space are negligible.

In search o a better classier
Classication rules.Classier A's most time-consuming operations are its multitude of calls to ILPexclude/include.We therefore propose, scrutinize, and employ a series of pre-processing rules in hopes that we can reduce the total number of calls required, thereby increasing the scalability of MDS-based biological network analytics.
Rule 1. Suppose u and v are adjacent, and the neighborhood of u is a proper subset of the neighborhood of v.If v is essential, then u is redundant.
Soundness.If an MDS contains v, then it cannot contain u, since otherwise the MDS would not be minimum.us, if every MDS contains v, then none can contain u. (Note the need for proper containment. then neither u nor v can be essential, and both must be redundant or both intermittent.)Rule 2. If u is not essential, and if every element in u's neighborhood is either essential or adjacent to an essential vertex, then u is redundant.
Soundness.is is a generalization of Rule 1, in which vertices in the neighborhood of u may be dominated by more than just a single essential vertex.
Rule 3. Suppose u but not v is contained in an MDS for which those vertices dominated only by u are in the neighborhood of v. en both u and v are intermittent.Classier B. We make use of these four rules in a procedure we name Classier B. is new classier need not invoke Classier A, because the aforementioned observations upon which Classier A relies are subsumed by Rules 2 and 4. On the other hand, the order in which rules are applied by Classier B is important if we are to avoid calling MDS multiple times.Bolstered by Rules 1-4, it should come as no surprise that Classier B provides a considerable improvement over Classier A. We will demonstrate this convincingly in the sequel.But rst we consider the possible utility of a more computationally demanding rule.

The use o algebraic symmetry
Orbits and automorphisms.In an eort to provide additional reductions in the number of ILP-exclude/ include calls required, we turn to notions of graph structure, neighborhood symmetry, and adjacency-preserving vertex permutations.
Rule 5.If V is partitioned into a set of vertex orbits, then vertices within the same orbit must possess the same classication.
Soundness.Vertices within the same orbit are indistinguishable under automorphic transformation, and so their classications will be identical.
Classier C. We therefore study yet a third procedure, which we christen Classier C. is new classier operates as does Classier B, except that it incorporates Rule 5 by rst computing all orbits and then, whenever a vertex is classied, any unclassied vertices in its orbit are assigned the same classication.
Classier C, like Classier B, requires low-order polynomial time to apply Rules 1-4, exponential time to solve a single instance of MDS, and time for at most n exponential-time calls to ILP-exclude/include. Classier C also needs low-order polynomial time to update orbit classications.More signicantly, it requires exponential time to determine the orbits themselves with known practical methods 23 .ese orbits can be found using bliss 24 , nauty 25 , and a variety of other popular, well documented, easy-to-use tools.From these we chose saucy 26,27 , by virtue of the fact that it has been tuned for sparse graphs, which are overwhelmingly representative of large-scale biological data.And indeed, saucy was roughly 10-20 times faster than bliss and over 1000 times faster than nauty across our test suite.We hasten to add, however, that saucy requires a bit more eort to implement than does nauty or bliss.is is because saucy only returns vertex pairs that occupy the same orbit.e user must then merge these pairs to form a complete orbit set., which runs in O(1.4864 n ) time and polynomial space.We were careful to avoid reproducibility problems that might arise from complex parameter settings.Our classiers take as input only nite simple graphs, while default settings were strictly obeyed for Gurobi.
In order to provide empirical comparisons at scale, all tests were executed on the Advanced Computing Facility (ACF) computational cluster maintained by the National Institute for Computational Sciences 32 .Timings were performed on a single core of ACF's monster (big memory) node using a Dell PowerEdge R630 server, an Intel Xeon E5-2687 W v4 30 MB Intel Smart Cache 3.00 GHz processor, 1,024 GB DDR4 memory, and ACF's read/write Network File System.Table 1.A test suite of real-world biological graphs.Types are CI (chromatin interaction), GC (gene co-expression), GFA (gene functional association), PPI (protein-protein interaction), and M (miscellaneous), where graph 32 is derived from biological functionality data, graph 33 is derived from drug-drug interactions, graph 34 is derived from human gene signaling and regulatory pathway interactions, and graphs 35 and 36 are derived from neuron connections in the y medulla and in the mouse retina, respectively.www.nature.com/scientificreports/ree dozen challenging graphs were assembled to form a comprehensive classier test suite.Graphs that populate this suite were obtained from well-known repositories and derived from transcriptomic, proteomic, epigenetic, and a variety of other sorts of biological data.We excluded from this suite any graph on which a classier failed to nish within 24 h, which generally seemed to result from exceptional size or, less frequently, from unusual density.Graphs thusly selected are described in Table 1.Runtimes per instance and classier are displayed in Table 2.

Empirical results.
We rst studied preprocessing, with success measured as a percentage of vertices clas-sied without an ILP-exclude/include call.Over our test suite, Classier A had an average success rate of only 14.1%.In contrast, Classier B had an average success rate of 67.2%, while Classier C had an average success rate of 72.5%.As expected, Rules 1-5 thus seem to place Classiers B and C at an enormous computational advantage.See Fig. 5.
We then turned to overall processing times.Unsurprisingly, we found that Classier A was simply not competitive.Its meager preprocessing success rate placed too great a burden on mathematical optimization soware.e computational demands of Rule 5, however, posed a pivotal question: is Classier C's modest reduction in ILP-exclude/include invocations a smart investment?In other words, do Classier C's time-consuming orbit computations translate into runtimes that are better than those of Classier B? e answer is hardly obvious.Even with a leading-edge graph automorphism package such as saucy, it can be exceedingly dicult to compete against ILP computations performed by a well-honed commercial product like Gurobi.Because runtimes varied greatly over the graphs in our test suite, we normalized all completion times to that of Classier A. Resultant calculations www.nature.com/scientificreports/revealed that, on average, Classiers B and C were more or less in a dead heat.Classier B took roughly 38.2% as long as Classier A, while Classier C took some 37.9% as long.us, under these experimental conditions, the overall impact made by adding Rule 5 was positive but barely noticeable.See Fig. 6.
It is dicult from these results to argue against the use of either Classier B or Classier C. Both are vastly more eective than Classier A. And while Classier B is the simpler of the two, Classier C was able to eke out a slight gain in speed.Having said that, we must remember that this endorsement is dependent on both our test suite and the computational resources available.Classiers B and C were highly competitive.Dierent datasets, alternative applications, or a change in automorphism soware may cause the added overhead and complexity of Classier C to have a much greater eect, either positive or negative, than was observed here.ese experiments in fact prompt a few serendipitous dataset observations, which we will discuss in the nal section.

Discussion
Conclusions.Major contributions of this paper include the development, analysis, implementation, and testing of ve novel classication rules and two highly innovative classier algorithms with which vertex signicance can be gauged in a network domination setting.Extensive empirical evidence of the practical usefulness of these powerful new rules and classiers was also generated using a comprehensive test suite centering on life science applications and biological data.
Classiers B and C turn out to be huge improvements over Classier A in terms of both preprocessing rates and overall runtimes.eir relative eectiveness would have been even more pronounced had we not had access to a commercial ILP solver with the exceptional eciency of Gurobi.Results from our extensive test suite suggest that Classiers B and C are very nearly equal in performance.Although Classier C was faster by a narrow margin, users may wish to give Classier B a slight nod for its comparative simplicity.
Patterns seen in results and data may be of additional interest.We observe, for example, the modest MDS size of chromatin interaction data (test graphs 1-9).Concomitantly, these are the only graphs for which the preprocessing performed by Classier C is signicantly better than that of Classier B. It seems plausible that this rather curious situation might be attributable to graph density, but most biological data is sparse, and indeed these graphs are roughly as sparse as all others in our test suite.We therefore turned to degree distributions and found that the chromatin interaction histograms appear normalesque and not scale-free like histograms for the rest of our test suite.Whether this is causative is unknown.We found it interesting too that all classiers were unusually successful in preprocessing graph 25 (bio-grid-worm).Upon investigation, we discovered that this graph has an extremely high number of redundant vertices.Whether this attribute relates to better preprocessing is unclear.And nally, graph 36 (bn-mouse-retina-1) caught our attention because it was especially dicult for all classiers, and yet its MDS is about the same size as those of the chromatin interaction graphs.Other than idiosyncrasies of data capture (neuronal connections imaged by electron microscopy), we can posit no particular basis for its computational recalcitrance.
Directions or uture research.e rules we have devised assign a single MDS classication to any vertex.
It is sometimes possible, however, to eliminate one classication option, making it reasonable to envisage more convoluted rules that assign a pair of classication choices to some vertices.As we have seen with Rule 5, however, the overhead and complexity of such a strategy must not be so high that it negates any meaningful gains.
MDS vertex classications may nd additional utility among problem variants.e study of independent dominating set, for instance, is a restatement of maximal independent set, and can be traced back roughly 60 years 33 .Other classic examples include connected dominating set 34 and total dominating set 35 .Vertex clas-sication strategies may also be of interest when data is drawn from reduced graph families.Limiting inputs to planar graphs, for example, is a popular restriction in circuit layout and many other engineering applications, although in our opinion this sort of limitation would be dicult to motivate from a biological perspective.
It might also be instructive to consider the relationship between orbit distributions and graph structure.For example, those who embrace the once-popular scale-free hypothesis 36 might predict that orbits would be found primarily among leaves that share a common neighbor.As a simple test, we therefore scanned the non-singleton orbit lists and computed the percentage of these lists that contained non-leaf vertices for each graph in our test suite.ese values turned out to range more or less uniformly between 4 and 100%.Unsurprisingly, it thus appears that the utility of automorphic transformation is highly data dependent, and that the extent to which Rule 5 applies is primarily a function of the particular graph under examination.is would seem to suggest that the relationship between orbits and the topology of graphs derived from biological data might warrant future study.
Finally, while our focus has been on practical applications, numerous theoretical questions beckon.We think it highly probable, for example, that classication strategies such as those we have developed here may prove useful for combinatorial problems other than MDS.Rule 5, in particular, seems to have something of a universal appeal.Another good example rests with worst-case classier behavior.Each method we have considered could in principle invoke an MDS solver as many as n + 1 times.Classier A in fact did exactly this, for instance, on test graph 5 (HiC-Net-10).Classiers B and C, on the other hand, never even came close to this sort of pathology.We think it is highly unlikely that real-world biological data of sucient size would cause either of these classiers to be so completely ineective.To the best of our knowledge, however, the sort of worst-case performance that might be attained with highly contrived data remains unknown.

4 .
Soundness.Replacing u with v produces a distinct but equivalent MDS.Rule If u has neighbors v and w whose only common neighbor is u and for which(N[N[v]] ∪ N[N[w]]) ⊂ N[u], then u is essential.Soundness.Because N[v] ∩ N[w] = {u},and because u dominates every vertex in N[N[v]] ∪ N[N[w]], it follows that u is required in any MDS, since otherwise at least two vertices from N[v] ∪ N[w] would be required in its place to dominate v and w. ese rules require only neighborhood explorations, and are thus amenable to illustration.Sample subgraph congurations are depicted in Figs. 1, 2, 3 and 4.

Figure 1 .
Figure 1.A sample subgraph subject to Rule 1.If v is essential, then u is redundant.

Figure 2 .
Figure 2. A sample subgraph subject to Rule 2. If every element in the set {u, v, w, x, y} is either essential or adjacent to an essential vertex, then u is either essential or redundant.

Figure 3 .
Figure 3.A sample subgraph subject to Rule 3. If u but not v is contained in an MDS, then v but not u is contained in some other MDS, and so both vertices are intermittent.

Figure 4 .
Figure 4.A sample subgraph subject to Rule 4. Vertex u must be essential.

Figure 5 .
Figure 5. Percent of vertices classied without ILP-exclude/include calls by Classiers A (in green), B (in red), and C (in blue).Dashed lines represent averages, which were 14.1%, 67.2%, and 72.5% for Classiers A, B, and C, respectively.

Figure 6 .
Figure 6.Overall runtimes of Classiers B (in red) and C (in blue), normalized to that of Classier A (in green).Dashed lines are almost collinear and represent averages, which were 38.2% and 37.9% for Classiers B and C, respectively.
30assiers A, B, and C were implemented in C+ + and compiled using the g+ + (GCC) version 4.8.5 compiler under the CentOS Linux 7 × 86-64 operating system.Various mathematical optimization soware packages were considered, including notable options such as CPLEX28and Xpress 29 .From these we chose Gurobi30for our ILP solver.It is a hugely successful, widely used, state-of-the-art commercial product.Moreover, Gurobi is freely available to many in the research community via academic site license.As in previous work, we used ILP to satisfy each classier's initial MDS requirement.Possible alternatives include the measure and conquer method of Vol.:(0123456789) Scientic Reports | (2022) 12:11897 | https://doi.org/10.1038/s41598-022-15464-4www.nature.com/scientificreports/Classier comparisons Computational milieu.

Table 2 .
Run times for each test suite instance and each classier, measured in seconds.